Junhao Cai, Jun Cen, Haokun Wang, Michael Yu Wang

The Hong Kong University of Science and Technology

Accepted by IEEE Robotics and Automation Letters (RA-L), 2022


Abstract

In this paper, we propose a novel vision-based grasp system for closed-loop 6-degrees of freedom grasping of unknown objects in cluttered environments. The key factor in our system is that we make the most of a geometry-aware scene representation based on a truncated signed distance function (TSDF) volume, which can handle multiple-view observations from the vision sensor, provide comprehensive spatial information for the grasp pose detector, and allow collision checking to achieve a collision-free grasp pose.

To eliminate the large computational burden caused by the volume-based data, a lightweight volume-point network (VPN) equipped with the Marching Cubes algorithm is proposed to predict the point-wise grasp qualities for all the candidates in a single feed-forward operation with real-time performance, enabling the system to perform closed-loop grasping. Furthermore, a grasp pose refinement module is integrated to predict the pose residual based on the SDF observation of the gripper state in the TSDF volume.

Extensive experiments show that the proposed method can achieve collision-free grasp detection with an efficiency of more than 30 Hz on a high-resolution volume. Furthermore, the model trained on only synthetic data achieved a 90.9% grasp success rate on cluttered real-world scenes, significantly outperforming other baseline methods.


Response to reviewers

  1. Complete table of evaluation on Antipodal Score / Collision-free Rate

To further show the ability to generalize to novel objects, we make a comparison between the primitive-shape objects and Kit and Procedural objects on metrics of antipodal scores and collision rates in the simulator. The experiment setting is the same as the one mentioned in Sec. IV. D in the paper and the result is shown in Table 1. The result shows that 1) the proposed method achieves the best performance on primitive-shape objects since they are known objects for the trained model, 2) the proposed methods also keep high antipodal scores and collision-free rates on the other two kinds of novel object sets, which demonstrates that our methods can also generalize to unseen objects.

2. More clarification about the pose ambiguity problem

In this work, the orientation component of each gripper pose is defined by a unit vector representing the approaching direction of the gripper, as well as its corresponding discretized angle around the vector to determine the last degree of freedom of the orientation. The reason for using this orientation definition is that most methods [3-5] can only predict one 6 DoF pose for a position candidate, which may introduce ambiguity when multiple orientations with respect to that grasp position are all feasible poses for grasping that object.

The following figure illustrates an example of a graspable pose set for a specific gripper position on a tennis ball. Given the basic frame shown as the solid arrows with the z axis passing through the center of the ball in the figure, the rotated frame that is generated by rotating the basic frame around the z axis.

We can see that for the same grasp position, there are infinite orientations for the gripper to successfully grasp the tennis ball, and the pose definition in [3-5] will lead to ambiguity when generating the data and training the model since their output can represent one pose only. By contrast, the angle discretization can absorb all the possible solutions into different discretized angles, which effectively avoid such problem. Intuitively, such orientation parameterization can mitigate the ambiguity, so as to speed up the training process and improve the quality prediction.

However, the angle discretization may introduce some quantized error because the discretized angle may not be the exact angle of grasping the object after the quantification operation. To deal with this problem, we further propose the grasp pose refinement network, which iteratively refines the candidate pose selected by the volume-point network and thus improve the grasp quality.

In short, the motivation of using this design is to avoid potential ambiguity caused by regressing a single pose only for a grasp position candidate. And the grasp pose refinement module is leveraged to refine the gripper pose selected from the VPN to reduce the side effect of quantification error. More description has been added to Sec. I and Sec. III. D in the paper.


3. Noisy Depth Maps

4. Visualization of Meshes Extracted by Marching Cubes

5. Trilinear Interpolation

6. Failure Cases

Collision

Collision

Slipping

Slipping

7. TSDF Fusion

1 frame

10 frames

20 frames

Reconstructed TSDF

Reference

[1] C. V. Nguyen, S. Izadi, and D. Lovell, “Modeling kinect sensor noise for improved 3d reconstruction and tracking,” in The International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission. IEEE, 2012.

[2] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996.

[3] Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” in Conference on Robot Learning. PMLR, 2019.

[4] P. Ni, W. Zhang, X. Zhu, and Q. Cao, “Pointnet++ grasping: learning an end-to-end spatial grasp generation algorithm from sparse point clouds,” in The International Conference on Robotics and Automation (ICRA). IEEE, 2020.

[5] M. Breyer, J. J. Chung, L. Ott, R. Siegwart, and J. Nieto, “Volumetric grasping network: Real-time 6-dof grasp detection in clutter,” in Conference on Robot Learning (CoRL). PMLR, 2020.

[6] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Yale-cmu-berkeley dataset for robotic manipulation research,” The International Journal of Robotics Research, 2017.

[7] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, 2019.

[8] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.